An open-source toolkit for mining Wikipedia

نویسندگان

  • David N. Milne
  • Ian H. Witten
چکیده

The online encyclopedia Wikipedia is a vast repository of information. For developers and researchers it represents a giant multilingual database of concepts and semantic relations; a promising resource for natural language processing and many other research areas. In this paper we introduce the Wikipedia Miner toolkit: an open-source collection of code that allows researchers and developers to easily integrate Wikipedia's rich semantics into their own applications. The Wikipedia Miner toolkit is already a mature product. In this paper we describe how it provides simplified, object-oriented access to Wikipedia’s structure and content, how it allows terms and concepts to be compared semantically, and how it can detect Wikipedia topics when they are mentioned in documents. We also describe how it has already been applied to several different research problems. However, the toolkit is not intended to be a complete, polished product; it is instead an entirely open-source project that we hope will continue to evolve.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History

We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but...

متن کامل

Transkribus Python Toolkit

This paper introduces an open source Python toolkit for the Transkribus platform. One part of the toolkit offers a Python client for the Transkribus RESTful interface. The second part offers various Document Understanding tools. The open-source toolkit is freely available through GitHub. Keywords—Transkribus platform, RESTful client, Document Understanding, Conditional Random Fields, Sequential...

متن کامل

Wikipedia Tools for Google Spreadsheets

In this paper, we introduce the Wikipedia Tools for Google Spreadsheets. Google Spreadsheets is part of a free, Webbased software office suite offered by Google within its Google Docs service. It allows users to create and edit spreadsheets online, while collaborating with other users in realtime. Wikipedia is a free-access, free-content Internet encyclopedia, whose content and data is availabl...

متن کامل

WikiNetTK - A Tool Kit for EmbeddingWorld Knowledge in NLP Applications

WikiNetTK is a Java-based open-source toolkit for facilitating the interaction with and the embedding of world knowledge in NLP applications. For user interaction we provide a visualization component, consisting of graphical and textual browsing tools. This allows the user to inspect the knowledge base to which WikiNetTK is applied. The application-oriented part of the toolkit provides various ...

متن کامل

Wikipedia's Labor Squeeze and its Consequences

INTRODUCTION ................................................................................... 158 I. MEASURING WIKIPEDIA’S SUCCESS ....................................... 159 II. THREATS TO WIKIPEDIA ......................................................... 161 III. WIKIPEDIA’S RESPONSE TO THE VANDAL AND SPAMMER THREATS ...........................................................................

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Artif. Intell.

دوره 194  شماره 

صفحات  -

تاریخ انتشار 2013